These are code snippets and a short summary of the fourth chapter, Geocentric Models, of the book “Statistical Rethinking” (version 2) by Richard McElreath.
The chapter discusses linear models and starts with a recap on the normal distributions. Why is it such a commonly used distribution and how does it arise?
pos <- replicate( 1000, sum( runif(16, -1, 1)))
dens(pos, norm.comp = T)
growth <- replicate(1000, prod(1 + runif(12, 0, 0.1)))
dens(growth, norm.comp = T)
This fails, if the growth factor is too large:
big <- replicate(1000, prod(1 + runif(12, 0, 0.5)))
dens(big, norm.comp = T)
The smaller the factor, the better the approximation:
small <- replicate(1000, prod(1 + runif(12, 0, 0.0001)))
dens(small, norm.comp = T)
log.big <- replicate(1000, log( prod( 1 + runif(12, 0, 0.05))))
dens(log.big, norm.comp = T)
There are boradly two reasons why we’re using Gaussian distributions in modelling: (1) ontological: we have reason to believe the process we’re modelling is actually following a Gaussian distribution, for the reasons laid out above. E.g. measurement errors follow a Gaussian since at the heart, the measurement errors arise through a process of added fluctuations. (2) epistemological: we don’t really know much about our process, except that it has a mean and a variance, so we use a normal distribution because it makes the least assumptions. (If we know more, we should probably use a different distribution!)
For the example, we’ll be using the !Kung data to estimate height.
A short look at the data:
# The !Kung Data
data("Howell1")
d <- Howell1
str(d )
'data.frame': 544 obs. of 4 variables:
$ height: num 152 140 137 157 145 ...
$ weight: num 47.8 36.5 31.9 53 41.3 ...
$ age : num 63 63 65 41 51 35 32 27 19 54 ...
$ male : int 1 0 0 1 0 1 0 1 0 1 ...
precis(d)
| mean | sd | 5.5% | 94.5% | histogram | |
|---|---|---|---|---|---|
| height | 138.2635963 | 27.6024476 | 81.108550 | 165.73500 | ▁▁▁▁▁▁▁▂▁▇▇▅▁ |
| weight | 35.6106176 | 14.7191782 | 9.360721 | 54.50289 | ▁▂▃▂▂▂▂▅▇▇▃▂▁ |
| age | 29.3443934 | 20.7468882 | 1.000000 | 66.13500 | ▇▅▅▃▅▂▂▁▁ |
| male | 0.4724265 | 0.4996986 | 0.000000 | 1.00000 | ▇▁▁▁▁▁▁▁▁▇ |
Since height strongly correlates with age before adulthood, we’ll only use adults for the following analysis:
d2 <- d[ d$age >= 18 ,]
dens(d2$height)
The distribution of height then looks approximately Gaussian.
Our model looks as follows: \[\begin{align*} h_i &\sim \text{Normal}(\mu, \sigma)\\ \mu &\sim \text{Normal}(178, 20) \\ \sigma &\sim \text{Uniform}(0, 50) \end{align*}\]
To better understand the assumptions we’re making with our priors, let’s have a short look at them:
curve( dnorm( x, 178, 20 ), from=100, to=250)
curve( dunif( x, 0, 50), from=-10, to=60)
To better understand what these priors mean, we can do a prior predictive simulation. That is, we sample from our priors and pluck this into the likelihood to find out what the model thinks would be reasonable observations, just based on the priors, before having seen the data.
sample_mu <- rnorm(1e4, mean=178, sd=20)
sample_sigma <- runif(1e4, 0, 50)
prior_height <- rnorm(1e4, mean=sample_mu, sd=sample_sigma)
dens(prior_height)
The model expects people to have a height between 100 and 250cm. Wide range, but seems reasonable. Compare this with the prior predictive we would get from flat priors (changing the standard deviation for the \(\$mu\) prior to 100):
sample_mu <- rnorm(1e4, mean=178, sd=100)
prior_height <- rnorm(1e4, mean=sample_mu, sd=sample_sigma)
dens(prior_height)
The flat prior think people could have negative height (to the left of the dashed line) or be extremely tall (the right line indicates the height of one of the tallest persons ever recorded). Not very sensible.
For educational purpose, let’s do a grid approximation of our model.
Start with the grid:
mu.list <- seq(from=150, to=160, length.out = 200)
sigma.list <- seq(from=7, to=9, length.out = 200)
post <- expand.grid(mu=mu.list, sigma=sigma.list)
For each value in the grid, compute the log-likelihood for the whole data:
post$LL <- sapply( 1:nrow(post), function(i) sum( dnorm(
d2$height,
mean=post$mu[i],
sd=post$sigma[i],
log=TRUE
)))
Since we’re working with the log-likelihood, we can sum everything up instead of multiplying. This is mostly to avoid rounding errors. So the same way, we can add the prior densities (again on log):
post$prod <- post$LL + dnorm( post$mu, 178, 20, TRUE) + # add log likelihood value to log prior values (corresponds to multiplying)
dunif( post$sigma, 0, 50, TRUE)
Finally, we return from log-scale to normal again but first scale it to avoid too small values and rounding errors.
post$prob <- exp( post$prod - max(post$prod))
Because of the scaling, the resulting values are actually not proper probabilities but relative probabilities.
We can visualize the distribution using a contour map:
contour_xyz( post$mu, post$sigma, post$prob)
Or a heatmap:
image_xyz( post$mu, post$sigma, post$prob)
The highest probability density for \(\mu\) is somewhere around 155 with highest probality densitiy for \(\sigma\) around 8.
Instead of working with the grid posterior, we can use the grid to obtain a sample from our posterior:
# sample from posterior
sample.rows <- sample( 1:nrow(post), size=1e4, replace=TRUE,
prob=post$prob)
sample.mu <- post$mu[ sample.rows ]
sample.sigma <- post$sigma[ sample.rows]
And can plot this instead:
plot(sample.mu, sample.sigma, cex=0.5, pch=16, col=col.alpha(rangi2, 0.1))